Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-blocking PUT in CHPL_COMM=ofi #25977

Merged
merged 25 commits into from
Oct 9, 2024
Merged

Non-blocking PUT in CHPL_COMM=ofi #25977

merged 25 commits into from
Oct 9, 2024

Conversation

jhh67
Copy link
Contributor

@jhh67 jhh67 commented Sep 23, 2024

This PR implements non-blocking PUTs in CHPL_COMM=ofi correctly. Prior to Chapel 2.2, non-blocking PUTs were implemented via blocking PUTs, and blocking PUTs would be typically be "injected". The OFI injection functionality allows the source buffer to be reused immediately, and suppresses the completion event that normally indicates that the operation is complete. As a result, both blocking and non-blocking PUTs would incorrectly be considered complete when the function returned. Because injection allows the source buffer to be reused immediately this would not cause a race on the source buffer, but it would allow PUTs to be reordered and could potentially cause hangs because the upper layers would not know that there were outstanding PUTs that might require progressing the communication endpoint. Although it doesn't appear that we saw these problems in production use, the potential existed.

This PR flips the situation around so that a blocking PUT is implemented as a non-blocking PUT followed by waiting for the PUT to complete. The lower-level communication primitives are now all non-blocking. PUTs may still be injected, but I will likely remove that functionality in a subsequent PR because 1) it doesn't make sense to inject a blocking PUT because you are going to wait for it to complete anyway, and 2) the semantics of non-blocking PUTs in Chapel are that the caller may not reuse the source buffer until the PUT completes. As a result, neither benefit from injection.

As part of implementing this PR I changed the dangling PUT logic. A "dangling PUT" one that may not be visible, and will need to be forced into visibility when required by the Chapel memory consistency model. Previously, PUTs were not allowed to dangle if transmit contexts were unbound, which led to a lot of special-case code to deal with bound and unbound transmit contexts. Now all dangling PUTs are treated the same whether or not the transmit contexts are bound, and dangling PUTs are forced into visibility when an unbound transmit context is freed. They could possibly be deferred until even later, but we expect unbound transmit contexts to be the exception so it doesn't make sense to optimize them further. This change simplifies the code without degrading performance.

@jhh67
Copy link
Contributor Author

jhh67 commented Oct 2, 2024

The following is the performance of the ra-rmo benchmark under different Chapel releases. The 2.1 release is the first with FI_ORDER_RMA_WAW erroneously enabled. The PR column is this pull request that has non-blocking PUTs implemented correctly. Prior to 2.1 non-blocking PUTs were incorrectly implemented using OFI's "inject" functionality, which provided to higher performance but had the potential to hang.

Nodes 1.32 2.0 2.1 main PR
1 0.0322198 0.0318668 0.00366552 0.0224351 0.0314711
2 0.044806 0.0460495 0.00454622 0.0306014 0.0434377
4 0.0730733 0.0737625 0.00714498 0.0502822 0.0696524
8 0.136939 0.13859 0.0143102 0.0940088 0.13141
16 0.263465 0.263353 0.027159 0.180891 0.251367
32 0.490895 0.489333 0.0547763 0.340288 0.467814
64 0.956173 0.935306 0.0992574 0.666866 0.910218

Notes:

  • chpl --fast test/release/examples/benchmarks/hpcc/ra.chpl -sverify=false -suseOn=false -sN_U="2**(n-4)"
  • ./ra-rmo --n=34 -nl ${i}x2
  • CHPL_RT_COMM_OFI_INJECT_RMA=true
  • Measured on hotlum

@jhh67 jhh67 requested a review from jabraham17 October 2, 2024 20:45
@jhh67 jhh67 marked this pull request as ready for review October 2, 2024 20:45
test/runtime/configMatters/comm/large-rma/EXECENV Outdated Show resolved Hide resolved
test/runtime/configMatters/comm/large-rma/README Outdated Show resolved Hide resolved
test/runtime/configMatters/comm/unbound/EXECENV Outdated Show resolved Hide resolved
test/runtime/configMatters/comm/unbound/README Outdated Show resolved Hide resolved
test/runtime/configMatters/comm/unbound/SKIPIF Outdated Show resolved Hide resolved
runtime/src/comm/ofi/comm-ofi.c Outdated Show resolved Hide resolved
Previously, non-blocking PUTs were implemented via blocking PUTs, which could
severely limit performance. Prior to 2.0, small PUTs invoked fi_inject_write,
which essentially turned them into non-blocking PUTs, but chpl_comm_put
returned as if the PUT was completed. This could cause MCM violations as well
as hangs caused by not progressing the network stack properly. These
deficiences were fixed in 2.0, but led to a performance regression. This
commit implements non-blocking PUTs correctly, so that the chpl_comm_*nb*
functions work correctly. This should restore 1.32.0 performance while
avoiding MCM violations and hangs.

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>
Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>
Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>
Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>
Rewrote PUT logic so that low-level functions are non-blocking, and a blocking
PUT is implemented by initiating a non-blocking PUT and waiting for it to
complete. This simplifies the implementation and avoids code duplication.

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>
Allow specifying the maximum message size and maximum number of endpoings.
These are intended primarily for testing.

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>
Also some code cleanup.

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>
We are now using this function to force visibility when an unbound endpoint is
released, so it needs to work on unbound endpoints.

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>
Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>
Operations to force visibility are deferred until the endpoint is released,
which requires the visibility bitmaps.

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>
Fixed how the number of transmit contexts needed is computed, and added some
comments.

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>
Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>
Change type of numTxCtxs and numRxCtxs to size_t to match type of
info->domain_attr->ep_cnt.

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>
Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>
Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>
Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>
Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>
Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>
Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>
Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>
Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>
Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>
Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>
Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>
Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>
@jhh67 jhh67 merged commit b3b7e55 into chapel-lang:main Oct 9, 2024
7 checks passed
@jhh67 jhh67 deleted the nbput branch October 9, 2024 19:13
@bradcray
Copy link
Member

bradcray commented Oct 9, 2024

@jhh67 : In your table above, would Chapel 2.2 be expected to essentially match the main column?

@jhh67
Copy link
Contributor Author

jhh67 commented Oct 10, 2024

@bradcray: Yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants